Mean Model Clustering

نویسندگان

  • Arindam Banerjee
  • Joydeep Ghosh
چکیده

In this paper, an effective and efficient model-based 2-clustering approach using Fisher kernel based similarities [2][3] is proposed. The Fisher kernel is derived from the knowledge of the family of the distribution generating the data. In the proposed approach, we refrain from trying to estimate the parameters of the unknown mixture distributions generating the data using the EM algorithm. Instead, the Fisher kernel values are computed based on the mean model obtained by assuming that the entire data has been generated from a single member of the family, rather than a mixture, and thereby systematically ignoring the so-called hidden random variables. The Fisher kernel similarities in terms of the mean model have an interesting property that turns out to be quite useful for data clustering. Consider a set ofN data points sampled independently following a mixture of two distributions with densities belonging to the parametric family f(x; ) parameterized by d real valued variables = [ 1; ; d℄ 2 , where is an open subset of Rd . The family of generative models f(x; ) define a d-dimensional Riemannian manifold S whose natural Riemannian metric is given by the Fisher Information matrix [1]. Let the two generating distributions be denoted by f(x; +) and f(x; ). Knowing fully well that the data has actually been sampled following a mixture of two distributions, we intentionally assume that the entire data has been sampled following a single distribution, i.e., the mean model f(x; 0), belonging to the same family. Note that the parameters 0 can be obtained in almost all practical cases directly from the given data by a simple maximum likelihood estimation. The concept of a distribution dependent Fisher kernel was introduced in [2]. Let K̂0( ; ) represent the Fisher kernel with respect to the estimated mean model. We show that if x1;x2 are samples drawn independently from f(x; +), and y1;y2 are samples drawn independently from f(x; ), then Pr[E[K̂0(x1;x2)℄ 0℄ = Pr[E[K̂0(y1;y2)℄ 0℄ = 1 where is a special coordinate system on the manifold S related to by a one-one differentiable map and ̂0 = (̂0), the expectation is over the joint distribution of the samples, and the probability is over the randomness of ̂0. Further, Pr[E[K 0(x1;y1)℄ < 0℄ > 1 tr(G 0G 1 ̂0 )=Nr2 0 , where G 0 and Ĝ0 are respectively the Fisher Information matrices at the points in the manifold corresponding to the true mean model and the estimated mean model and r0 is a positive constant. The lower bound on this probability goes to 1 as N !1. Probabilistic lower bounds on the actual kernel values K̂0(x1;x2) and K̂0(y1;y2) being positive, and K̂0(x1;y1) being negative can be computed for a given family and its value is primarily determined by the concentration of the sufficient statistic random variable corresponding to the parametric family under consideration around its expectation. Any clustering algorithm that can use the mean model Fisher kernel similarities to separate the data can be used for a mean model clustering scheme. We propose a simple graph-based mean model clustering scheme. Given the set of data-points, sampled from either f(x; +) or f(x; ) with uniform priors, let A denote the pairwise kernel similarity matrix of the points so that [A℄ij = K̂0(xi;xj). Considering A to be the adjacency matrix of a graph whose vertices are the data points, a 2-partitioning of the graph following a min-cut objective gives the graph-based clustering of the data.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimization and design of Adaptive Neuro-Fuzzy Inference System using Particle Swarm Optimization and Fuzzy C-Means Clustering to predict the scour after bucket spillway

Additionally, if the materials at downstream of bucket spillway are erodible, the ogee spillway is likely to overturn by the time. Therefore, the prediction of the scour after bucket spillway is pretty important. In this study, the scour depths at downstream of bucket spillway are modeled using a new meta-heuristic model. This model is developed by combination of the Adaptive Neuro-Fuzzy Infere...

متن کامل

Customer behavior mining based on RFM model to improve the customer relationship management

Companies’ managers are very enthusiastic to extract the hidden and valuable knowledge from their organization data. Data mining is a new and well-known technique, which can be implemented on customers data and discover the hidden knowledge and information from customers' behaviors. Organizations use data mining to improve their customer relationship management processes. In this paper R, F, an...

متن کامل

Fuzzy C-Means Clustering Algorithm for Site Selection of Groundwater Artificial Recharge Areas (Case Study: Sefied Dasht Plain)

Artificial recharge can be an effective method to raise the groundwater table and to resolve the groundwater crisis in Sefid dasht plain. The most important step to successful accomplishment of artificial recharge is locating suitable areas for artificial recharge. Hence this research carried out with purpose of determining suitable areas for artificial recharge in Sefid dasht plain. Slope, sur...

متن کامل

An Optimization K-Modes Clustering Algorithm with Elephant Herding Optimization Algorithm for Crime Clustering

The detection and prevention of crime, in the past few decades, required several years of research and analysis. However, today, thanks to smart systems based on data mining techniques, it is possible to detect and prevent crime in a considerably less time. Classification and clustering-based smart techniques can classify and cluster the crime-related samples. The most important factor in the c...

متن کامل

Modeling of a Probabilistic Re-Entrant Line Bounded by Limited Operation Utilization Time

This paper presents an analytical model based on mean value analysis (MVA) technique for a probabilistic re-entrant line. The objective is to develop a solution method to determine the total cycle time of a Reflow Screening (RS) operation in a semiconductor assembly plant. The uniqueness of this operation is that it has to be borrowed from another department in order to perform the production s...

متن کامل

An improved opposition-based Crow Search Algorithm for Data Clustering

Data clustering is an ideal way of working with a huge amount of data and looking for a structure in the dataset. In other words, clustering is the classification of the same data; the similarity among the data in a cluster is maximum and the similarity among the data in the different clusters is minimal. The innovation of this paper is a clustering method based on the Crow Search Algorithm (CS...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007